Linear Model#
See the backing repository for Linear Model here.
Summary#
Linear / logistic regression, where the relationship between the response and its explanatory variables are modeled with linear predictor functions. This is one of the foundational models in statistical modeling, has quick training time and offers good interpretability, but has varying model performance. The implementation is a light wrapper to the linear / logistic regression exposed in scikit-learn.
How it Works#
Christoph Molnar’s “Interpretable Machine Learning” e-book [1] has an excellent overview on linear and regression models that can be found here and here respectively.
For implementation specific details, scikit-learn’s user guide [2] on linear and regression models are solid and can be found here.
Code Example#
The following code will train a logistic regression for the breast cancer dataset. The visualizations provided will be for both global and local explanations.
from interpret import set_visualize_provider
from interpret.provider import InlineProvider
set_visualize_provider(InlineProvider())
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from interpret.glassbox import LogisticRegression
from interpret import show
seed = 42
np.random.seed(seed)
X, y = load_breast_cancer(return_X_y=True, as_frame=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=seed)
lr = LogisticRegression(max_iter=3000, random_state=seed)
lr.fit(X_train, y_train)
auc = roc_auc_score(y_test, lr.predict_proba(X_test)[:, 1])
print("AUC: {:.3f}".format(auc))
AUC: 0.998
show(lr.explain_global())
show(lr.explain_local(X_test[:5], y_test[:5]), 0)
Further Resources#
Bibliography#
[1] Christoph Molnar. Interpretable machine learning. Lulu. com, 2020.
[2] Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, and others. Scikit-learn: machine learning in python. the Journal of machine Learning research, 12:2825–2830, 2011.
API#
LinearRegression#
- class interpret.glassbox.LinearRegression(feature_names=None, feature_types=None, linear_class=<class 'sklearn.linear_model._base.LinearRegression'>, **kwargs)#
Linear regression.
Currently wrapper around linear models in scikit-learn: scikit-learn/scikit-learn
Initializes class.
- Parameters:
feature_names – List of feature names.
feature_types – List of feature types.
linear_class – A scikit-learn linear class.
**kwargs – Kwargs pass to linear class at initialization time.
- explain_global(name=None)#
Provides global explanation for model.
- Parameters:
name – User-defined explanation name.
- Returns:
An explanation object, visualizing feature-value pairs as horizontal bar chart.
- explain_local(X, y=None, name=None)#
Provides local explanations for provided instances.
- Parameters:
X – Numpy array for X to explain.
y – Numpy vector for y to explain.
name – User-defined explanation name.
- Returns:
An explanation object, visualizing feature-value pairs for each instance as horizontal bar charts.
- fit(X, y)#
Fits model to provided instances.
- Parameters:
X – Numpy array for training instances.
y – Numpy array as training labels.
- Returns:
Itself.
- predict(X)#
Predicts on provided instances.
- Parameters:
X – Numpy array for instances.
- Returns:
Predicted class label per instance.
- score(X, y, sample_weight=None)#
Return the coefficient of determination of the prediction.
The coefficient of determination \(R^2\) is defined as \((1 - \frac{u}{v})\), where \(u\) is the residual sum of squares
((y_true - y_pred)** 2).sum()and \(v\) is the total sum of squares((y_true - y_true.mean()) ** 2).sum(). The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a \(R^2\) score of 0.0.- Parameters:
X (array-like of shape (n_samples, n_features)) – Test samples. For some estimators this may be a precomputed kernel matrix or a list of generic objects instead with shape
(n_samples, n_samples_fitted), wheren_samples_fittedis the number of samples used in the fitting for the estimator.y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – True values for X.
sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.
- Returns:
score – \(R^2\) of
self.predict(X)w.r.t. y.- Return type:
float
Notes
The \(R^2\) score used when calling
scoreon a regressor usesmultioutput='uniform_average'from version 0.23 to keep consistent with default value ofr2_score(). This influences thescoremethod of all the multioutput regressors (except forMultiOutputRegressor).
LogisticRegression#
- class interpret.glassbox.LogisticRegression(feature_names=None, feature_types=None, linear_class=<class 'sklearn.linear_model._logistic.LogisticRegression'>, **kwargs)#
Logistic regression.
Currently wrapper around linear models in scikit-learn: scikit-learn/scikit-learn
Initializes class.
- Parameters:
feature_names – List of feature names.
feature_types – List of feature types.
linear_class – A scikit-learn linear class.
**kwargs – Kwargs pass to linear class at initialization time.
- explain_global(name=None)#
Provides global explanation for model.
- Parameters:
name – User-defined explanation name.
- Returns:
An explanation object, visualizing feature-value pairs as horizontal bar chart.
- explain_local(X, y=None, name=None)#
Provides local explanations for provided instances.
- Parameters:
X – Numpy array for X to explain.
y – Numpy vector for y to explain.
name – User-defined explanation name.
- Returns:
An explanation object, visualizing feature-value pairs for each instance as horizontal bar charts.
- fit(X, y)#
Fits model to provided instances.
- Parameters:
X – Numpy array for training instances.
y – Numpy array as training labels.
- Returns:
Itself.
- predict(X)#
Predicts on provided instances.
- Parameters:
X – Numpy array for instances.
- Returns:
Predicted class label per instance.
- predict_proba(X)#
Probability estimates on provided instances.
- Parameters:
X – Numpy array for instances.
- Returns:
Probability estimate of instance for each class.
- score(X, y, sample_weight=None)#
Return the mean accuracy on the given test data and labels.
In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Test samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – True labels for X.
sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.
- Returns:
score – Mean accuracy of
self.predict(X)w.r.t. y.- Return type:
float